Method of indexing and retrieval of electronically-stored documents

ABSTRACT

A document indexing and retrieval system and method which assigns weights to the key words and assigns a relative value to pairs of key words (i.e. defines a relative relation on K×K) based on their frequency of occurrence and co-occurrence in the document data base. In response to a query both the weights and this relative relation are used to suggest additional and/or alternative key words which are very likely to find relevant documents. Documents are then ranked by number of hits adjusted for the weights of hit words and their relative values.

This is a continuation of application Ser. No. 97/998,023, filed Dec.29, 1992; which is a continuation-in-part of U.S. application Ser. No.07/456,558, filed Dec. 26, 1989, both now abandoned.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates generally to document storage and retrievalsystems and more particularly to a method of indexing documents so thatthey can be retrieved in response to a query in order of their relevanceto the query. It also permits, general query to be easily modified basedon the content of the documents so that the new query will retrievedocuments that are relevant to the original query.

2. Description of the Prior Art

Document retrieval based on indexing of the documents in a document database is well known. Typically the documents are indexed by creating anindex file which records the documents that each word is in. Then whenthe user inputs a query, the documents that contain one or more words ofthe query can be quickly identified. However, if the query consists ofgeneral words that are not terms of art, the query may produceunsatisfactory retrieval results by either producing few documents thatare of interest to the user or producing many documents that are notinteresting to the user or both.

SUMMARY OF THE INVENTION

A principal object of the present invention is to provide an improvedmethod of indexing and retrieving documents which:

(A) allows a user to easily modify his query based on the content of thedocuments so that the new query will retrieve documents that are ofinterest to the user;

(B) accurately ranks the documents in order of relevance to the query;and

(C) allows the user to peruse the documents extremely quickly.

Another object of the present invention is to use the Soft BooleanConnector concept to adjust the number of hits (i.e., the number ofquery words that a document is credited with for ranking purposes) bygiving less than a full hit to a word that often co-occurs with otherquery words. Another object of the present invention is to use the SoftBoolean Connector concept to adjust the number of hits (i.e. the numberof query words that a word is credited with by virtue of its beingrelated to those query words) for a possible suggested word by givingless than a full hit to a word that often co-occurs with the other querywords.

These objects, as well as other objects which will become apparent fromthe discussion that follows, are achieved according to the presentinvention by the following steps (note: in the following the words"term" and "keyword" stand for both a single word and a phraseconsisting of a group of words, e.g., "patent application".):

1. Indexing the documents by creating index files of which documentscontain each term, how many times the term appears in the document, andhow many documents each term appears in.

2. Assigning as many weights to each term as there are documents thatcontain that term, where the weight of a term in a document depends onthe number of times the term appears in the document, the number ofdocuments that the term appears in, and the total number of terms in thedocument.

3. Constructing for each term a ranked list of companions of said termwhich list contains the terms (companions) that appear in the samedocuments as said term in order of the sum of the weights of thecompanions over all documents that contain both the term and thecompanion. Associated with each companion is the companion percentagewhich is the sum used to rank the companions.

4. Using the companion lists to construct relative lists for each termwhich relative lists usually contain only those companions which alsohave said term as a companion. Associated with each relative is therelative percentage which is a weighted average of the companion'spercentage as a companion of the term and the term's companionpercentage as a companion of the companion. The relative percentages areused to rank the relatives.

5. Assigning a "polysemantic" weight to each term, which polysemanticweight depends on the number of documents that the term is in, thenumber of relatives that the term has, and the relative strength of thefirst few relatives to the other relatives.

6. Presenting to the user, in response to a query, a list of "SWAPS"(Synthetic Word Association Pattern Search) terms that are the bestrelatives to the entire group of terms contained in the query andallowing the user to add one or more of the presented terms to thequery.

7. Ranking the documents according to how many query terms are containedin the document, their polysemantic weights and their weights in thedocuments.

The present invention facilitates the rapid searching of a document database for documents that are of interest to the user. By using thesuggested SWAPS terms the user can modify his query so as to retrievethose documents, if they exist in the data base, which are of interest.Since the SWAPS terms that are presented are in many of the documentsthat the original query terms are in, adding them to the query isguaranteed to retrieve those documents and others containing the SWAPSterms. By using the SWAPS feature repeatedly the user can in effect roamaround the data base without actually retrieving and reading documents.Only after the query has been modified to include all the interestingSWAPS terms, does the user need to actually retrieve the documents. Theuser can start with a poor query and modify it using SWAPS so that itbecomes a good query. The user need not waste time formulating a goodquery that will not retrieve any relevant documents because there happento be no such documents in the data base. The SWAPS terms that aresuggested will always retrieve documents that contain them i.e.documents that are likely to be relevant.

The ranking of the documents also facilitates rapid searching becausethe user can be confident that the highest ranked documents will be thedocuments that are most relevant to the query and that all documentswhich have any relevance will be retrieved and ranked.

The foregoing and other objects, features and advantages of the presentinvention will become apparent from the following, more particulardescription of the preferred embodiments of the invention, asillustrated in the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a computer system embodying the presentinvention;

FIG. 2 is a view of the display screen showing an entered query and theresult of parsing it;

FIG. 3 is a view of the display screen showing suggested SWAPS terms forthe query of FIG. 2;

FIG. 4 is a view of the display screen showing the modified query;

FIG. 5 is a view of the display screen showing suggested SWAPS terms forthe modified query of FIG. 4;

FIG. 6 is a view of the display screen showing a second modification ofthe query based on choosing SWAPS terms from FIG. 5;

FIG. 7 is a view of the display screen as a result of ranking thedocuments for the query of FIG. 6;

FIG. 8 is an operational flow diagram for indexing a set of documents;

FIG. 9 is a procedure tree for the QSEARCH program used for searching anindexed set of documents using the SWAPS and RANKING features;

FIGS. 10A to 10J are description of the program modules in FIG. 8;

FIG. 11 is a description of the program modules in FIG. 9; and

FIGS. 12A to 12C are description of the ABSTRACT program module.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

This invention will now be described as embodied in a computer system ofthe type shown in FIG. 1. This embodiment utilizes the followingcomputer hardware and software:

(1) IBM compatible personal computer with at least 4 MB of RAM, a largecapacity hard drive, a display screen, and a keyboard.

(2) MS-DOS compatible operating system and LIM 3.2 compatible expandedmemory manager.

(3) A vocabulary file of terms (words and phrases)

(4) A series of programs that index the documents by constructingvarious files that hold information about which terms are in whichdocuments, which documents contain which terms, the weights of theterms, and which terms are relatives of other terms by virtue ofoccurring in the same documents and how strongly are they are related.

(5) A user program that accepts a query, suggests modifications to thequery, and ranks the documents based on the modified query using theweights and relative strengths of the terms of the query.

The Vocabulary file is structured as a list of headwords each with ashort synonym list. All of the synonyms of a given headword are assignedthe same code.

The full list of indexing programs can be found in FIG. 10. Here we willdescribe the most important of these programs: AIM, AIMPASS2, FREQCOMP,RELATIVE, and POLYSEMY.

The first indexing program is AIM.BAS: Automatic Indexing Module. Itcreates DocKeys, DocIndex, and IDF. DocKeys holds all of the Keywordsand Keyword-Counts for all documents. IDF holds the document frequency,i.e., the number of documents a keyword appears in.

As the words in the documents are checked against the vocabulary to seeif they are keywords, the case (upper or lower) is possibly changed andthey are stripped of prefixes to see if the different case or stem is akeyword according to the following algorithms: (UC=upper-case andLC=lower-case)

IF UC word is at the beginning of a sentence AND we don't have it in ourvocabulary as a LC word THEN look for it the Vocabulary as an UC word

IF UC word in middle of sentence AND we don't have it UC THEN look forit if it doesn't have a typical proper name ending

In USER Program Only: IF word NOT found THEN find both the stem AND findthe Good prefix

(In the following "find" means that the stem and/or prefix is said to bein the document if the prefix is of the right type and the stem has theindicated length and is a keyword.)

IF GOOD prefix THEN

Find GOOD prefix if stem>3 characters long

IF word is found THEN find if stem >8 characters long

IF word is NOT found THEN find if stem >5 characters long

IF POOR prefix THEN

If word is found THEN DON'T find stem

If word is NOT found THEN find if stem >5 characters long

List of Poor Prefixes:

hi, co, de, en, ex, im, in, un, re, con, eco, dis, epi, mal, mid, mis,non, off, out, pre, pan, sub, uni, demi, down, fore, hemi, high, meta,over para, peri, post, self, semi, after, inter, quasi, trans, under

List of Good Prefixes:

air, bio, sea, sky, top, aero, anti, auto, back, head, home, homo, hemo,mega, mini, mono, rear, poly, self, tele, viro, chemo, ferro, homeo,hyper, infra, intra, macro, micro, multi, hydro, radio, super, supra,ultra, contra, hetero, thermo, techno, nucleo, counter, electro, magneto

The next indexing program is AIMPASS2.BAS. It creates Key and Weightfiles. The nth Rec of Key.Ndx contains NumKeysinDoc(n) followed by up to127 Key codes which have Weight greater than or equal to the AdaptiveThreshold Value. The Adaptive Threshold Value is the average Weightvalue of the 80th Keyword in each document (0 if there are less than 80Keywords in a document). The nth Rec of Weight. Ndx contains up to 127(or as many Keywords are above the Adaptive Threshold Value) DocumentWeights computed with the following weight formula: ##EQU1##

FREQCOMP.BAS implements the Inverted Index access method along with theweighted values to calculate the frequent companions for each of thewords used in the document collection.

For each word ("A") in the controlled vocabulary dictionary, the WEIGHT(see above formula) values for each co-occurring word in the document (aco-occurring word to A is one that appears as a Keyword in the samedocument that A appears as a Keyword) are summed, along with the WEIGHTvalues for A in that document, respectively in all documents in whichthey co-occur. The sum values for each co-occurring word are convertedto a percentage, scaled to the sum value for A (i.e., percentage=sum forword's WEIGHT values divided by the sum for A's WEIGHT values). Notethat the percentages for the co-occurring words can be higher than 100%if they are heavily weighted in the same records in which A appears. Theco-occurring words are then sorted in descending order (from highestpercentage value to lowest) and the top 127 are written to a file (seebelow for structure). If there are 127 co-occurring words or fewer, thenall of the co-occurring words will be written in descending sortedorder..

    ______________________________________                                        Definitions:                                                                  { } = co-occurring                                                            weight = WEIGHT value                                                         Example:                                                                       ##STR1##                                                                      ##STR2##                                                                      ##STR3##                                                                     Resulting File:                                                               Main Word     Co-Occurring Words . . . (sorted)                               ______________________________________                                        A             B116% . . .                                                     B             A 63%. . .                                                      .                                                                             .                                                                             ______________________________________                                    

After the frequent companions have been found RELATIVE.BAS is run todefine the relatives of each Keyword (A) according to the followingalgorithm:

are there any FreqComps for A? If so, then for each FreqComp of A (F):

look for F in A's FreqComp List and get its value

look for the word itself (A) in word A's FreqComp List and get its value

a apply formula of (Lower×6+Higher)/7, where Lower is the lower of thetwo values obtained in the above two steps and Higher is the higher ofthe two values.

a sort in the resulting list of words and values in decreasing order, byvalue

a save the first 63 (or as many as are found) of this list as therelatives for keyword A

For each word (called "A") in the dictionary which has FrequentCompanions (not all do, because some words in the dictionary are notused at all in a database), take each Frequent Companion of A (called"F") and its Frequent Companion Percentage Value [FCPVal] in A'sFrequent Companion List [FCList](called "F-VAL") and look for the FCPValof A in F's FCList (called "A-VAL"). NOTE: If A is not found in F'sFCList, then A-VAL is zero (0). The RELATIVE value for F is calculatedby multiplying the smaller of F-VAL and A-VAL by 6, adding the larger ofF-VAL and A-VAL, and then dividing that sum by 7. If both A and F are ineach other's FC lists, the resulting Relative value will be added toboth words' Relative lists. If F is in A's FC List, but A is not in F'sthen F-VAL will be divided by seven and added only to A's Relative list.

After all the RELATIVE values are calculated for each Frequent Companion(F) in A's FCList, they are sorted in descending order and the top 63 ofthese words are written to A's Relative List. If there are fewer than 63Relatives, then all of the Relatives will be written to A's RelativeList, in descending order of RELATIVE value. ##EQU2##

Here the SmallerPercent Value is the smaller of the A-VAL and the F-VALand the LargerPercent Value is the larger of the A-VAL and the F-VAL.

    ______________________________________                                        Sample:                                                                        ##STR4##                                                                     Resulting File:                                                               Main Word         Relatives . . . (sorted)                                    ______________________________________                                        A                 B 70 . . .                                                  B                 A 70 . . .                                                  .                                                                             .                                                                             ______________________________________                                    

After the relatives have been found each of the keywords is given asingle polysemantic weight that does not change from document todocument by the program POLYSEMY.BAS which uses the following formula:##EQU3##

Here Avg_(n) is the average of the relative percentages of the first nrelatives of the keyword, TotRelVal is the sum of relative percentagesover all relative lists that the keyword is in, and DocFreq is thenumber of documents that the keyword is in (having a WEIGHT above theadaptive threshold).

Once the indexing programs have been run, the ABSTRACT program is run tocreate highlights of the full text that will be presented to the userbefore or in place of the full text itself. First the documents arebroken into sentences using a Sentence Ends Algorithm. Then thesentences are assigned weights (values) as a whole and the top rankedsentences are chosen to be part of the highlight. Finally a Sanitizealgorithm is used to "X" out (eliminate) proper names from in thehighlights. See FIG. 12 for specific details on the algorithms used inthe ABSTRACT program.

Once the indexing and optionally the ABSTRACT programs have been run,the QSEARCH program can be used to search for documents. This is done byentering a query in natural language. The user program will parse thequery to find all the keywords it contains using algorithms similar tothose in the AIM program.

After the query is parsed the user is shown the keywords that arecontained in the query in order of their polysemantic weight and isgiven the opportunity to add and delete words in the query and to havethe program suggest SWAPS terms based on the query. These SWAPS termsare generated by generating for each keyword in the vocabulary asummed-relpoly-percentage which is the sum, over all terms that are inthe query, of the relpoly percentages of that keyword, where the relpolypercentage is the product of the relative percentage and thepolysemantic weight. Then the summed RelPoly percentages are adjustedusing a concept called Soft Boolean Connectors to come up with a finalSWAPS value for each keyword. The keywords are then ranked by SWAPSvalue and the highest ranked are presented to the user as suggestedSWAPS terms to be added to the query.

The Soft Boolean Connectors concept involves penalizing pairs of termsthat co-occur often (i.e., in many documents) when calculating theadjustment to be applied to the summed relpoly percentages.

    ______________________________________                                        First, Multiply the last group of SWAPS words by                              Boost Factor (=2)                                                             Then add relative values of relatives of                                      main word after each is multiplied by the                                            PolyValue of the Word                                                  (The previous value will be called "Temp Value")                              Create table for every pair combination of query words, e.g.,                 for words A, B, & C, there are three pairs:                                          AB                                                                            AC                                                                            BC                                                                     ______________________________________                                    

For each pair of query words, ("A" & "B"), the Relative Value used inthe formula below is B's Relative Value in A's Relative List, or, if Bdoesn't appear in A's Relative List, then the value is taken from A'sRelative Value in B's Relative List (this is possible because theRelative Value between any two words is mutual), i.e., if B is found inA's Relative list, take just that value. You don't need to look at B'slist to find A's value there because, if it is there, it would have thesame value as B has in A's list. Only if B is not in A's Relative listcheck for A in B's list. Enter the Relative Penalty value resulting fromthe following formula into the table for each combination (pair):##EQU4##

    ______________________________________                                        MAXIMUM PENALTY TABLE (SWAPS)                                                        query words                                                                            Max.                                                          ______________________________________                                        (for each pair)                                                                      2        0.3                                                                  3        1.0                                                                  4 & up   0.9                                                           (for sum of pairs)                                                                   2        0.3                                                                  3        1.4                                                                  4        1.8                                                                  5        2.3                                                                  6 & up   2.8                                                           ______________________________________                                    

After the user has modified the query by choosing SWAPS terms, he canhave the program suggest new SWAPS terms based on the new query. In thiscase the program boosts the relative percentages of the last chosen setof SWAPS terms before calculating summed relpoly percentages. Thisallows the user to navigate in the data base by modifying his query sothat it will find documents containing the SWAPS terms.

For example, FIG. 2 shows the options the user will be presented withafter entering the query "when can a contract be enforced". If the userchooses the menu option "Related Terms" he will be shown a list of SWAPSterms as shown in FIG. 3. This first set of SWAPS terms that arepresented to the user includes the term "statutes". The user may chooseone or more of these suggested SWAPS terms to add to the query. In FIG.4 we see that the user has chosen to add the term "statutes" to thequery. At this point the user can again ask the system to suggest SWAPSwords. This time the previously added SWAPS term "statutes" will begiven extra weight in determining which new terms are suggested to theuser. In FIG. 5 we see the resulting suggested SWAPS terms generatedfrom the four query terms "agreement", "statutes", "enforcement", and"can", with "statutes" given more weight than the other three terms.Notice that the SWAPS words are ranked somewhat differently than in FIG.3 and in particular a new SWAPS term "statute of limitations" issuggested. By adding the term "statutes" to the query and then askingagain for suggested SWAPS terms the user has "moved" the query to "anarea of the database" that contains documents dealing with "statute oflimitations", which is a term of art that makes the original query morefocused and is likely to find documents that are relevant to the intentof the original query. Here the fact that both terms "statutes" and"statute of limitations" contain the same word is fortuitous. It is themeaning of the term "statutes" which makes it a close relative of"statute of limitations" by virtue of the fact that these two termsco-occur in many of the same documents.

Once the user is satisfied with his query he asks the program toretrieve documents that are relevant to the query. In FIG. 6 he wouldchoose the View Documents option. The system will then use its indexfiles to assign a value to each document and then rank the documents.The documents are ranked by generating for each document asummed-weightpoly-value which is the sum, over all terms that are in thequery, of the weightpoly values of that keyword, where the weightpolyvalue is the product of the weight of the keyword in that document andits polysemantic weight. Then the summed-weightpoly values are adjustedusing the Soft Boolean Connectors concept to come up with a final valuefor each document. The documents are then ranked by value and presentedto the user in order of rank.

The Soft Boolean Connectors concept involves penalizing pairs of termsthat co-occur often (i.e. in many documents) when calculating theadjustment to be applied to the summed relpoly percentages. First,multiply original query words by

Boost Factor (=2)

Then add WEIGHT values of key words in a document after each ismultiplied by the PolyValue of the word.

    ______________________________________                                        (The previous value will be called "Temp Value")                              Create table for every pair combination of query words (A B C)                       AB                                                                            AC                                                                            BC                                                                     ______________________________________                                    

For each pair of query words, ("A" & "B"), the Relative Value used inthe formula below is B's Relative Value in A's Relative List, or, if Bdoesn't appear in A's Relative List, then the value is taken from A'sRelative Value in B's Relative List (this is possible because theRelative Value between any two words is mutual), i.e., if B is found inA's Relative list, take just that value. You don't need to look at B'slist to find A's value there because, if it is there, it would have thesame value as B has in A's list. Only if B is not in A's Relative listcheck for A in B's list. Enter the Relative Penalty value resulting fromthe following formula into the table for each combination. ##EQU5##

    ______________________________________                                        MAXIMUM PENALTY TABLE (RANKING)                                                      query words                                                                            Max.                                                          ______________________________________                                        (for each pair)                                                                      2        0.5                                                                  3        1.3                                                                  4        1.2                                                                  5 & up   1.1                                                           (for sum of pairs)                                                                   2        0.5                                                                  3        1.6                                                                  4        1.9                                                                  5        2.3                                                                  6 & up   2.8                                                           ______________________________________                                    

To facilitate very rapid perusal of the ranked documents, the documentvalues (used in the ranking) are presented as a bar graph as shown inFIG. 7. Also the documents are presented in 3 forms. The first formconsists of a ranked array of the highest ranked terms in the documentthat requires only about 1/3 of the display screen (FIG. 7). The secondform consists of a program generated "highlight" of the document whichconsists of very short portions of the document of less than a dozenwords that contain the highest ranked terms. This highlight scrolls inabout 2/3 of the screen and is shown along with the array of highestranked terms. The third form consists of the full text of the documentwhich can be scrolled. The user can use arrow keys to move rapidly fromone document to the next.

Appendix 1 contains the full BASIC program source code that implementsthe preferred embodiment described above. This code must be compiledusing the Microsoft 7.1 BASIC compiler to produce object modules whichmust then be linked along with libraries containing object code forassembler routines from the Crescent Software QuickPak ProfessionalAdvanced Programming Library for BASIC Compilers Version 4.12 to producean executable file.

There has thus been shown and described a novel document indexing andretrieval system which fulfills all the objects and advantages soughttherefor. Many changes, modifications, variations and other uses andapplications of the subject invention will, however, become apparent tothose skilled in the art after considering this specification and theaccompanying drawings which disclose the preferred embodiments therefor.All such changes, modifications, variations and other uses andapplications which do not depart from the spirit and scope of theinvention are deemed to be covered by the invention which is limitedonly by the claims which follow. ##SPC1##

What is claimed is:
 1. A method of indexing and retrieving documents,said method using a digital computer system having a central processingunit, a memory, a display screen, a keyboard, and a large capacity filesystem, said method comprising the steps of:(a) storing in said memory avocabulary of terms, each term consisting of one or more words, and foreach term an associated term-code; (b) storing on said file system acollection of documents each with an associated unique document-number;(c) creating index files which contain for each said term-code in (a)(i)the set of document-numbers in (b) such that the corresponding documentscontain the corresponding term; and (ii) for each saiddocument-identifying-number in (i) the frequency-in-document of thecorresponding term which is the number of times that said term appearsin the corresponding document; (d) creating a weight-in-document filewhich contains for each document-number in (c)(i) the weight-in-documentof the corresponding term which is calculated using thefrequency-in-document in (c) (ii), the number of document-numbers in (c)(i), and the total number of terms in (a) which are in the correspondingdocument (counted multiple times); (e) creating a frequent-companionfile which contains for each occurring term-code in (a) a ranked set ofpairs of numbers where each pair consists of a first element term-codeand a second element companion-percentage, where thecompanion-percentage is calculated by summing the weight-in-documentvalues of said first element term-code over documents that contain boththe term corresponding to said first element term-code and the termcorresponding to said occurring term-code and then dividing by the sumover all documents of the weight-in-document of said occurringterm-code; (f) creating a relative file which contains for eachoccurring term-code in (a) a ranked set of pairs of numbers where eachpair consists of a first element relative term-code and a second elementrelative-percentage, where the relative-percentage is calculated bytaking a weighted average of the companion-percentage of said firstelement term-code calculated in step (e) and the companion-percentage ofsaid occurring term-code that was calculated in step (e) when said firstelement term-code was the occurring term-code and said occurringterm-code was the first element term-code; (g) creating a polysemanticfile which contains for each occurring term-code in (a), a polysemanticweight which is calculated using the number of sets of pairs in therelative file created in step (f) that said occurring term-code appearsin, the number of documents-numbers for which the weight-in-document ofsaid occurring term-code calculated in step (d) is greater than somethreshold value, and the averages for several values of N of the first Nrelative-percentages of said occurring term-code calculated and rankedin step (f); (h) accepting a query consisting of a sequence of wordsentered by a user using said keyboard and creating a parsed-query tableof term-codes which consist of the term-codes in said vocabulary thatare associated with the terms that are contained in said query; (i)creating a temporary swap table of pairs of first element term-codes andcorresponding second element summed-relative-percentages consisting ofthose relative term-codes created in step (f) where said correspondingsecond element summed-relative-percentages are the sum, over all saidoccurring term-codes that are in said parsed-query table, of therelative percentages of said first element term-codes; (j) creating amodified swap table by modifying said second elementsummed-relative-percentages created in step (i) by multiplying them by afunction of the polysemantic weight of the corresponding first elementterm-codes; (k) sorting said modified swap table by said modifiedsummed-relative-percentages in descending order; (l) displaying on saiddisplay the terms corresponding to the term-codes of said modified swaptable; (m) accepting user keypresses or other actions which identify oneor more of the terms displayed in step (l) and adding the correspondingterm-codes to the parsed-query-table; (n) repeating steps (i) through(m) as many times as the user indicates by his input; (o) accepting aninput from the user indicating a command to retrieve documents; (p)creating a temporary rank table of pairs of first elementdocument-numbers and corresponding second elementsummed-document-weight×poly values which pairs comprise thosedocument-numbers for which any of the term-codes that are in saidparsed-query table have weight-in-document above a threshold value, andsummed-document-weight×poly values which are the sums, over allterm-codes in said parsed-query table, of a function of me polysemanticweight of the term-code and the weight-in-document of the term-code; (r)creating a sorted rank table by sorting said temporary rank table by thevalue of the second elements of the pairs in descending order; (s)displaying on the display screen some portion of the documentcorresponding to the first document number in the sorted rank table andsome indication of the corresponding summed-document-weight×poly value;(t) displaying other documents corresponding to other document-numbersin the sorted rank table in response to inputs from the user.
 2. Amethod as in claim 1 wherein additional steps (j)(l) and (p)(l) arecarried out after steps (j) and (p) respectively to implement the softboolean connector algorithm which consists of the following steps:(A)creating a table of relative penalties for each pair of said term-codesin said parsed-query table where said relative penalty is a function ofthe relative percentage corresponding to the two term-codes of saidpair, the number of documents that each of the term-codes of the pairare contained in with a document-weight above a threshold, and theaverage over all terms of the number of documents that the term iscontained in with a document-weight above said threshold; (B) modifyingsaid relative penalties by taking the minimum of the relative penaltyand some maximum value which depends on the number of terms in theparsed-query table; (C) summing said modified relative penalties toproduce a sum of relative penalties; (D) modifying said sum of relativepenalties by taking the minimum of said sum and some maximum sum valuewhich depends on the number of terms in the parsed-query table toproduce a modified sum of penalties; (E) summing some function of thepolysemantic weights of the term-codes in the parsed-query table thatare either relatives of a potential SWAPS term (jl) or are contained ina document (pl) to produce a number of hits value; (F) Calculating somefunction of the number of hits value and the modified sum of penaltiesvalue to produce a power value; (G) Raising a number approximately equalto 2 to the power value to produce an adjust value; (H) Multiplyingeither the modified summed relative percentages calculated in step j) orthe summed document weight×poly values calculated in step (p) by theadjust value.
 3. A method as in claim 1 where the formula forcalculating the weight-in-document in step (d) is: ##EQU6##
 4. A methodas in claim 1 where the formula for calculating the polysemantic weightin step (g) is: ##EQU7##
 5. A method as in claim 1 where the function instep (j) is the identity function.
 6. A method as in claim 1 where thefunction in step (p) is the identity function.